This is a basic analysis of the top 100 songs on Spotify for the year 2017. The audio features for each song were extracted using the Spotify Web API and the spotipy Python library. Credit goes to Spotify for calculating the audio feature values. This dataset is publicly available on Kaggle.
We will only look at a few columns that are of interest to us.
Library imports:
library(tidyverse)
library(knitr)
Data import:
df <- read_csv("spotify-2017.csv",
col_types = cols(mode = col_character()))
df <- df %>% mutate(mode = fct_recode(mode,
"Major" = "1.0",
"Minor" = "0.0"))
For this analysis, we will focus on mode, tempo, valence and loudness. Below are the details for these columns. For details on the remainder of the columns, see here.
name
: Name of the songartists
: Artist(s) of the songloudness
: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.mode
: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.valence
: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).tempo
: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.Here is a glimpse of the dataset:
df <- df %>% select(name, artists, loudness, mode, valence, tempo)
kable(head(df))
name | artists | loudness | mode | valence | tempo |
---|---|---|---|---|---|
Shape of You | Ed Sheeran | -3.183 | Minor | 0.931 | 95.977 |
Despacito - Remix | Luis Fonsi | -4.328 | Major | 0.813 | 88.931 |
Despacito (Featuring Daddy Yankee) | Luis Fonsi | -4.757 | Major | 0.846 | 177.833 |
Something Just Like This | The Chainsmokers | -6.769 | Minor | 0.446 | 103.019 |
I’m the One | DJ Khaled | -4.284 | Major | 0.811 | 80.924 |
HUMBLE. | Kendrick Lamar | -6.842 | Minor | 0.400 | 150.020 |
First, we want to test if there is a difference in the distribution of tempo between songs in a major key and songs in a minor key. Let’s look at this in a histogram:
ggplot(data = df, mapping = aes(x = tempo)) +
geom_histogram(aes(fill = mode)) +
facet_wrap(~ mode)
The distribution in both plots look quite similar, with a large peak around 100 and maybe a smaller peak around 130-150.
We can plot both these distributions as a density plot:
ggplot(data = df, mapping = aes(x = tempo)) +
geom_density(aes(col = mode))
The two distributions look very similar.
Let’s compute the mean tempo for each of the modes:
df %>% group_by(mode) %>%
summarize(mean_tempo = mean(tempo))
## # A tibble: 2 x 2
## mode mean_tempo
## <fct> <dbl>
## 1 Minor 116.
## 2 Major 122.
Test if the difference in mean scores for the sexes is significant or not with the \(t\)-test:
major_data <- (df %>% filter(mode == "Major"))$tempo
minor_data <- (df %>% filter(mode == "Minor"))$tempo
t.test(major_data, minor_data, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: major_data and minor_data
## t = 1.0377, df = 97.475, p-value = 0.302
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.152959 16.446581
## sample estimates:
## mean of x mean of y
## 121.5741 115.9273
The \(p\)-value for this test is around 0.30, so we wouldn’t reject the null hypothesis in favor of the alternative hypothesis.
Test if the distribution of tempo for songs in major key is significantly different from the distribution of tempo for songs in minor key with the Kolmogorov-Smirnov test:
ks.test(major_data, minor_data, alternative = "two.sided")
##
## Two-sample Kolmogorov-Smirnov test
##
## data: major_data and minor_data
## D = 0.13136, p-value = 0.7946
## alternative hypothesis: two-sided
The p-value for this test is around 0.80, so we don’t have enough evidence to reject the null hypothesis (i.e. the data we have could have reasonably come from the distribution under the null hypothesis).
Scatterplot of valence
vs. loudness
:
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
geom_point()
Let’s fit a linear model of valence
vs. loudness
. Expectation: The louder the song, the happier it is. Hence, we expect a positive relationship.
lm(valence ~ loudness, data = df)
##
## Call:
## lm(formula = valence ~ loudness, data = df)
##
## Coefficients:
## (Intercept) loudness
## 0.79386 0.04897
Get more information on the linear fit with summary
:
fit <- lm(valence ~ loudness, data = df)
summary(fit)
##
## Call:
## lm(formula = valence ~ loudness, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39188 -0.12018 -0.00328 0.14860 0.40816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79386 0.06570 12.08 < 2e-16 ***
## loudness 0.04897 0.01108 4.42 2.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1986 on 98 degrees of freedom
## Multiple R-squared: 0.1662, Adjusted R-squared: 0.1577
## F-statistic: 19.54 on 1 and 98 DF, p-value: 2.548e-05
From the summary, the correlation between valence and loudness is statistically significant.
Plot the linear fit along with the scatterplot:
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
geom_point() +
geom_smooth(method = "lm")
Whether a song is in a major key or a minor key could affect the relationship between valence and loudness. Expectation: ???
ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) +
geom_point() +
facet_grid(. ~ mode)
First, let’s fit the additive model:
fit <- lm(valence ~ loudness + mode, data = df)
summary(fit)
##
## Call:
## lm(formula = valence ~ loudness + mode, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3945 -0.1170 -0.0027 0.1498 0.4119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79121 0.06839 11.569 < 2e-16 ***
## loudness 0.04912 0.01118 4.394 2.85e-05 ***
## modeMajor 0.00605 0.04061 0.149 0.882
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1996 on 97 degrees of freedom
## Multiple R-squared: 0.1664, Adjusted R-squared: 0.1492
## F-statistic: 9.684 on 2 and 97 DF, p-value: 0.0001464
In this model, it seems like whether a song is in a major or minor key doesn’t make a big difference.
Next, let’s fit the model with interactions:
fit <- lm(valence ~ loudness * mode, data = df)
summary(fit)
##
## Call:
## lm(formula = valence ~ loudness * mode, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39179 -0.12289 -0.00013 0.14712 0.42106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.82557 0.09463 8.724 8.16e-14 ***
## loudness 0.05541 0.01638 3.384 0.00104 **
## modeMajor -0.06058 0.13270 -0.457 0.64905
## loudness:modeMajor -0.01186 0.02248 -0.528 0.59898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2004 on 96 degrees of freedom
## Multiple R-squared: 0.1688, Adjusted R-squared: 0.1429
## F-statistic: 6.501 on 3 and 96 DF, p-value: 0.0004735
We can also draw the linear regression fits with the scatterplot:
ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ mode)
We can see a slight change in slope, but they look basically the same. This is more obvious when both are plotted on the same plot:
ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Overall, it looks like whether a song is in a major or minor key does not have an effect on the tempo of the song. It also does not seem to influence the relationship between valence (a song’s happiness index) and loudness. As expected, there is a positive relationship between valence and loudness.